Better safe than sorry—Whole-genome sequencing indicates that missense variants are significant in susceptibility to COVID-19

Undoubtedly, genetic factors play an important role in susceptibility and resistance to COVID-19. In this study, we conducted the GWAS analysis. Out of 15,489,173 SNPs, we identified 18,191 significant SNPs for severe and 11,799 SNPs for resistant phenotype, showing that a great number of loci were significant in different COVID-19 representations. The majority of variants were synonymous (60.56% for severe, 58.46% for resistant phenotype) or located in introns (55.77% for severe, 59.83% for resistant phenotype). We identified the most significant SNPs for a severe outcome (in AJAP1 intron) and for COVID resistance (in FIG4 intron). We found no missense variants with a potential causal function on resistance to COVID-19; however, two missense variants were determined as significant a severe phenotype (in PM20D1 and LRP4 exons). None of the aforementioned SNPs and missense variants found in this study have been previously associated with COVID-19.


Introduction
The established risk factors of coronavirus disease 2019 (COVID-19) are advanced age, male sex, and comorbidities [1][2][3], but they do not fully explain the wide spectrum of disease manifestations [4]. Several studies investigated other factors associated with different COVID-19 outcomes that were either environmental, population-based, or genetic. As demonstrated by numerous association studies, the genetics of the host can influence the difference in response to infection and therefore, different clinical presentations of COVID-19. To date, several genetic variants that are associated with susceptibility and resistance to COVID-19 were identified. They are associated with ancestry, virus entry, receptors, and immunological processes. (WGS) with the number of genetic markers being very large, the assumptions underlying the Bonferroni correction are violated, since it assumes complete independence between particular tests, which is not the case in WGS-based SNPs due to very high Linkage Disequilibrium (LD) typically existing between closely linked variants. Moreover, when a phenotype is determined by several genes, a single-SNP model leads to a representation of the genetic background only by one variant. This means that effects of other SNPs which potentially influence genetic variation either contribute to the polygenic part or to the residual part of the modelwhere they are confounded with non-genetic variance components. As a result, GWAS models often lead to poor reproducibility of significant genes reports [26] and the fact that genes with moderate effects are missed due to an elevated type II error caused by conservative multiple testing correction. Since resistance to COVID-19 infection and the risk of a severe infection emerge as oligogenic or even polygenic phenotypes, single-SNP models provide an oversimplified description of their genetic background. An alternative strategy is to use a multi-SNP model, fitting all SNPs simultaneously. The major advantage of the simultaneous inclusion of all SNPs is accounting for the correlation between SNPs, which is due to the LD, introduced into the model by the SNP genotype design matrix. Additionally, using LD in the model may increase the power of detecting correlated causal SNPs [27]. Also, the power of detection of the association between phenotype and genotype in a multi-SNP model is higher than in the single-SNP, where the power is undermined by modest SNP effect sizes, unobserved causal SNPs and a correlation among adjacent SNPs [28]. The next advantage of the multi-SNP model is that the unexplained portion of the genetic variance is lower than in the case of the single-SNP model. Therefore, the main aim of our study was to identify genetic variants (SNPs) marking genes responsible for the increased susceptibility or resistance to COVID-19 infections focusing not only on high-risk variants but also on SNPs with moderate effects, reflecting the possibly oligogenic mode of inheritance of the infection phenotypes.

Sample collection
Blood samples were collected from 1235 individuals across Poland between April 2020 and April 2021. However, for this analysis, a subset of 1076 samples from unrelated individuals was used. Only individuals without diagnosed severe health disorders (till the moment of sample collection), such as cancer, were qualified for this study. The NEWS (National Early Warning Score) scale has been used for the assessment of all hospitalised patients [29]. Within this cohort, the group with a severe outcome of COVID-19 infection (N = 235) was composed of patients with severe, life-threatening cases of COVID-19 infection including respiratory insufficiency, requiring intensive medical care and artificial ventilation, NEWS>5 and the group resistant to infection (N = 306) was composed of volunteers that did not contract the diseases nor develop any symptoms despite being highly exposed to COVID-19 -this group had multiple antibody blood-based tests conducted to confirm they had no antibodies anti-SARS-CoV-2. Detailed information about the cohort, including demographic and clinical features, can be found in Kaja et al. [30].

Ethical statement
All participants, or guardians/parents of the participants under 18, provided written informed consent before the collection of blood samples and filling in the clinical data form, which included a questionnaire about the country of origin and chronic diseases. Collected consent forms are stored in the Central Clinical Hospital of the Ministry of Interior and Administration in Warsaw, as instructed by the General Data Protection Regulation (GDPR) act. The ethical approval for the study was obtained from the Ethics Committee of the Central Clinical Hospital of the Ministry of Interior and Administration in Warsaw (decision nr: 41/ 2020 from 03.04.2020 and 125/2020 from 1.07.2020). The study complies with the 1964 Helsinki declaration and its later amendments and adhered to the highest data security standards of ISO 27001 and the General Data Protection Regulation (GDPR) act.

Total quality management
The project was carried out under the Total Quality Management (TQM) methodology, which ensures the quality of results and analyses the risk and possible difficulties. TQM requires defining all critical points of the procedures: reference ranges for collected biological material, its preparation, isolation, DNA concentration and quality, genomic sequencing, and quality control of the data. The legal and ethical transparency of the entire project was ensured, including confidentiality, integrity, and impartiality of the data.

Whole genome sequencing
The whole genomes of 1076 unrelated participants were sequenced in this study. 4ml of K-EDTA peripheral blood from participants were collected according to a standardised Quality Management System protocol. Genomic DNA was isolated from the peripheral blood leukocytes using a QIAamp DNA Blood Mini Kit, Blood/Cell DNA Mini Kit (Syngen) and Xpure Blood Kit (A&A Biotechnology) according to the manufacturer's protocols. The concentration and purity of isolated DNA were measured using the NanoDropTM spectrophotometer, and the quality of the DNA was evaluated using gel electrophoresis. The sequencing library was prepared by Macrogen Europe (Amsterdam, the Netherlands) using TruSeq DNA PCR-free kit (Illumina Inc., San Diego, California, United States) and 550 bp inserts. The quality of DNA libraries was measured using 2100 Bioanalyzer, Agilent Technologies. The Whole Genome Sequencing (WGS) was performed on the Illumina NovaSeq 6000 platform using 150 bp paired-end reads, yielding a mean depth of coverage of 35.26X in the cohort.

Pre-processing of whole-genome sequence data
The quality of sequenced reads was assessed using FastQC v0.11.7 (see Web Resources). Thereafter reads were mapped to the GRCh38 human reference genome using Speedseq framework v.0.

Phenotype encoding
The cohort analyzed in this study was divided into multiple categories: control, resistant, benign, mild, and severe. In the analysis of the resistance to COVID-19 infection the resistant group was coded as "1", the control group was removed from the analysis, and all the other groups were coded as "0". In the analysis of the susceptibility to severe COVID-19 infection the severe group was coded as "1", the control group was removed from the analysis, and all the other groups were coded as "0".

Genome-wide association study and SNP functional annotation
To include the information on the covariance between SNPs, which is mainly due to the linkage disequilibrium, a mixed linear model simultaneously fitting additive effects of all SNPs was used: where y represents a vector of binarised disease status, β is a vector of fixed effects comprising: a general mean, age at data sampling [years] and gender with a design matrix X, g is a vector of random additive SNP effects with the pre-imposed normal distribution given by g~N(0, G) and the corresponding design matrix Z containing SNP genotype codes parameterised as 0, 1, or 2 for a homozygous, heterozygous and an alternative homozygous SNP genotype, respectively, ε represents the residual effect with a pre-imposed normal distribution given by ε~N(0, R). The model covariance structure was thus defined by the covariance matrix of SNP effects, G ¼ Is 2 g and the covariance matrix of the residual effects expressed as R ¼ Is 2 ε , which resulted in the covariance of y given by ZGZ t +R. In our study, both variance components were assumed as known and set to:

Significant SNP-sets
The original number of the called SNPs amounted to 43,469,928, while, after filtering, 15,489,173 SNPs remained for GWAS. The GWAS analysis demonstrated a large number of significant SNPs and associated genes (Fig 1). In the case of patients that suffered from a severe course of COVID-19, 18,191 SNPs were identified as significant, while in the case of resistant patients, 11,799 SNPs were significant. The majority of significant SNPs were located in introns (55.77% for severe phenotype and 59.83% for resistant phenotype), followed by intergenic regions (36.55% for severe phenotype and 32.92% for resistant phenotype) and regulatory regions (5.41% for severe phenotype and 4.69% for resistant phenotype). The occurrence of SNPs in coding regions was below 1%. Among them, the majority of variants were synonymous (60.56% for severe phenotype and 58.46% for resistant phenotype) (Fig 2).
Missense and synonymous SNPs were annotated to genes and subjected to the enrichment analysis, which however did not reveal any significantly overrepresented functional features. The lists of genes harbouring the significant SNPs for severe and resistant phenotypes were summarised in the S1 and S2 Tables, respectively.

COVID-19 related significant SNPs
The most significant SNP (rs2101196) for severe outcome was located on chromosome 1, in the intron of adherens junctions-associated protein 1 gene (AJAP1) while the SNP (rs7772946) that was the most significant for resistance towards COVID-19 was located on chromosome 6, in the intron of FIG4 Phosphoinositide 5-Phosphatase (FIG4). Altogether, the AJAP1 gene was marked by nine significant, intronic SNPs and FIG4 by as many as 41 SNPs located in the intron (Fig 3). However, while considering potential causal mutations, we would like to raise attention to two missense variants that pointed at genes with functions that may potentially influence the risk of a severe infection outcome: (i) a variant located in an exon of the peptidase M20 domain containing 1 gene (PM20D1) and (ii) a variant located in an exon of the low-density lipoprotein receptor-related protein 4 gene (LRP4) (Fig 4, Table 1).

Discussion
COVID-19 infections pose a serious global health concern hence it is crucial to identify biomarkers for the susceptibility and resistance against this disease that could help in a rapid risk assessment and reliable decisions on patients' treatment and potential hospitalisation. Moreover, COVID-19 manifests a broad spectrum of clinical signs ranging from an asymptomatic to a severe course . Another study reported 13 genome-wide significant loci associated with COVID-19 infection and severity (rs2271616, rs10490770, rs11919389, rs1886814, rs72711165, rs912805253, rs10774671, rs1819040, rs77534576, rs2109069, rs74956615, rs4801778, rs13050728). Some of these variants overlap with previously reported associations In our study, we incorporated all SNPs identified using WGS, together in one statistical model with a normal distribution pre-imposed on their effects. By a simultaneous inclusion of all SNPs into the model we accounted for the intercorrelation between them, which was due to LD, which technically was inserted into the model by the SNP genotype design matrix. Therefore, we aimed to select the set of variants that significantly influence the variation in genomes of people who suffered from severe COVID-19 infection and those who never got infected despite exposure (i.e. resistant), focusing on potential causal mutations represented by SNPs located within coding or regulatory regions with high, but also moderate effects. Since the effects of causal variants on phenotypes underlying a complex mode of inheritance are typically moderate to small and many variants are in LD, usage of the whole-genome sequence data and statistical analysis that fits variants simultaneously provides a more realistic statistical handling of the phenotype [50]. As demonstrated in Fig 5, especially for the severe phenotype, as compared to conventional, single-SNP GWAS, the multi-SNP model also identifies SNPs with lower P values, corresponding to lower effects on the response to infection. In general,

PLOS ONE
our analysis identified the number of SNPs significant for severe and resistant phenotypes. The patterns in distribution and type of mutations for these two phenotypes were very similar (Fig 1). The majority of genetic variants were located in introns (over 50%), followed by intergenic regions (over 30%) and a small proportion of variants in regulatory regions (4.5-5%). Interestingly, although the majority of variants were synonymous, there was a considerable number of missense variants, both for severe and resistant phenotypes (39.44% and 41.54%, respectively).
The most significant SNP (rs2101196) for severe outcome was identified on chromosome 1, in the intron of AJAP1 (Fig 3). AJAP1 is a membrane protein that blocks epithelial-to-mesenchymal transition through interaction with β-catenin and inhibiting its nuclear translocation. Overexpression of AJAP1 may reduce metastasis of hepatocellular carcinoma [51]. In the case of the resistance towards COVID-19, the most significant SNP (rs7772946) was located on chromosome 6, in the intron of FIG4 (Fig 3). Variants in the FIG4 gene lead to obstruction of endocytic trafficking, causing the accumulation of enlarged vesicles in murine peripheral neurons and fibroblasts. Therefore, some variants in FIG4 are related to neurological disorders, such as Charcot-Marie-Tooth disease and Yunis-Varón syndrome [52,53]. However, none of the above associations seems to be important in the case of COVID-19 related phenotypes.
However, by estimating the effects of all SNPs simultaneously, we mainly focussed on identifying SNPs with potential causal effects, located in coding regions, that may often be missed due to their lower effect on phenotypes as compared to non-coding SNPs being in high LD with such causal mutations. Such SNPs located in coding regions may not list as the top-significant variants because their consequence on phenotypes suppresses their genotypic variability. This typically results in a low frequency of the detrimental allele and hence of the corresponding genotypes in the population. In our sample, no significant missense variants were identified for the resistance to COVID-19, however, two missense variants that pointed at genes

PLOS ONE
with functions that may potentially influence the risk of a severe infection outcome were determined (Table 1). One variant was located in the exon of the peptidase M20 domain containing 1 gene (PM20D1) and this particular SNP was associated with the QT interval [54]. QT interval captures the time it takes for the cardiac ventricles to depolarize and repolarize. It was shown

PLOS ONE
that azithromycin and hydroxychloroquine, drugs used in the treatment of COVID-19 could prolong the QT interval on the electrocardiogram, increasing the risk of tachycardia and sudden cardiac death [55]. Other diseases associated with the increased frequency of the PM20D1 variant are diabetes and obesity [56], and neurodegenerative diseases, such as Parkinson's Disease [57] and Alzheimer's Disease [58]. Interestingly two variants in the PM20D1 have been associated with severe COVID-19 courses requiring ventilatory support and have been located in an epigenome-wide association study [59]. The second missense variant was located in the exon of the low-density lipoprotein receptor-related protein 4 gene (LRP4) and two phenotypes, potentially related to COVID-19, were determined for it. This particular SNP was associated with D-dimer levels [60]. COVID-19 is a condition associated with elevated D-dimer. Several studies confirmed this association and showed that D-dimer is elevated in patients with severe COVID-19. The highest D-dimer level was observed in most critically ill patients and those who did not survive [4,[61][62][63]. This suggests that D-dimer could be a predictor of serious illness and death due to COVID-19. Moreover, this missense variant was also related to postsynaptic congenital myasthenic syndrome [64]. Patients affected by neuromuscular disorders are at higher risk for severe COVID-19 due to respiratory and swallowing muscle weakness [65].
None of identified significant SNPs (in the intron of AJAP1 and the intron of FIG4) and none of the missense variants with a potential causal function on a severe outcome of COVID-19 (in the exon of PM20D1 and the exon of LRP4), found in our study, have been previously associated with COVID-19. It is not surprising, having in mind a generally low overlap between COVID-19 related GWAS. Still, among 1820 genes for severe and 904 genes for

PLOS ONE
resistance marked by SNPs significant in our study, sequentially 20 and 33 have been previously reported, but none have been associated with COVID-19. At this point, we can only hypothesise that the above-mentioned gene discoveries were possible due to the application of the multi-SNP approach, but more dedicated methodological research is needed to (dis)prove the hypothesis.

Conclusions
Although there are several already established risk factors of COVID-19, such as advanced age, being a male, or coexisting comorbidities, they do not fully explain the wide spectrum of disease outcomes. Undoubtedly, genetic factors also play an important role. Our study presented a solution for searching genes responsible for the increased susceptibility or resistance to COVID-19 infections considering not only high-risk variants but also SNPs with moderate effects, reflecting the possibly oligogenic mode of inheritance of the infection phenotypes. So far, several genetic variants related to ancestry, virus entry, receptors, and immunological processes have been associated with susceptibility and resistance to  In this study, the most significant, albeit non-causal, SNPs pointed at the AJAP1 gene for a severe outcome (blocking epithelial-to-mesenchymal transition) and the FIG4 gene for resistance (accumulation of enlarged vesicles in murine peripheral neurons and fibroblasts), while potentially causal, missense variants pointed at two genes with functions that are promising to influence the risk of a severe outcome, namely: missense change in LRP4 gene could be associated with D-dimer levels and postsynaptic congenital myasthenic syndrome, while missense change in PM20D1 might be associated with the QT interval.
It's already been over two years with COVID-19, accumulating data, both clinical and genomic, but we are only at the very beginning of our understanding of the genetic mechanisms underlying the infection. Having huge databases created during the pandemic, now the time has come for more in-depth research to be performed to understand various outcomes of COVID-19 infection and to understand their mode of inheritance, that in view of the relatively low overlap in literature reports published so far, is expected to be of a multi/oligo -genic rather than monogenic.
One of the most limiting factors in applying population-based GWAS to large genomes is the sample size. From the statistical perspective, insufficient sample size leads to elevated type I and type II errors. For genomes composed of millions of base pairs, a large sample size is limited by sequencing costs. Two approaches to enlarge sample sizes emerge: (i) to use a cohort sequenced with high coverage for imputation of variants in a larger cohort that is only genotyped by a microarray [66], (ii) using a cohort sequenced at a moderate coverage without a need of imputation. Both approaches are associated with limitations. Although very broadly used, the former approach does not allow for accurate detection of rare, patient-specific mutations [67], which is not reflected by overall imputation accuracies. The latter approach, which was adopted in our study, is associated with the risk of non-detecting an existing variant due to missing coverage of some nucleotides [68]. Since the available material achieved a moderate genome averaged coverage of over 35X and it is currently the largest sequenced cohort representing the Polish population, the imputation was not considered in our study.
Moreover, in humans, the sample size is also limited by the number of available probands, especially for rare diseases or for disorders that are difficult to quantify-that in the case of our study was the assessment of COVID-19 severity, that could only be conducted in combination to hospital examination and conducting the follow of the environmental (i.e. mainly family) history of each patient.
Supporting information S1